3 research outputs found

    On the performance of phonetic algorithms in microtext normalization

    Get PDF
    User鈥揼enerated content published on microblogging social networks constitutes a priceless source of information. However, microtexts usually deviate from the standard lexical and grammatical rules of the language, thus making its processing by traditional intelligent systems very difficult. As an answer, microtext normalization consists in transforming those non鈥搒tandard microtexts into standard well鈥搘ritten texts as a preprocessing step, allowing traditional approaches to continue with their usual processing. Given the importance of phonetic phenomena in non鈥搒tandard text formation, an essential element of the knowledge base of a normalizer would be the phonetic rules that encode these phenomena, which can be found in the so鈥揷alled phonetic algorithms. In this work we experiment with a wide range of phonetic algorithms for the English language. The aim of this study is to determine the best phonetic algorithms within the context of candidate generation for microtext normalization. In other words, we intend to find those algorithms that taking as input non鈥搒tandard terms to be normalized allow us to obtain as output the smallest possible sets of normalization candidates which still contain the corresponding target standard words. As it will be stated, the choice of the phonetic algorithm will depend heavily on the capabilities of the candidate selection mechanism which we usually find at the end of a microtext normalization pipeline. The faster it can make the right choices among big enough sets of candidates, the more we can sacrifice on the precision of the phonetic algorithms in favour of coverage in order to increase the overall performance of the normalization systemAgencia Estatal de Investigaci贸n | Ref. TIN2017-85160-C2-1-RAgencia Estatal de Investigaci贸n | Ref. TIN2017-85160-C2-2-RMinisterio de Econom铆a y Competitividad | Ref. FFI2014-51978-C2-1-RMinisterio de Econom铆a y Competitividad | Ref. FFI2014-51978-C2-2-RXunta de Galicia | Ref. ED431D-2017/12Xunta de Galicia | Ref. ED431B2017/01Xunta de Galicia | Ref. ED431D R2016/046Ministerio de Econom铆a y Competitividad | Ref. BES-2015-07376

    Early stopping by correlating online indicators in neural networks

    Get PDF
    Financiado para publicaci贸n en acceso aberto: Universidade de Vigo/CISUGinfo:eu-repo/grantAgreement/AEI/Plan Estatal de Investigaci贸n Cient铆fica y T茅cnica y de Innovaci贸n 2013-2016/TIN2017-85160-C2-2-R/ES/AVANCES EN NUEVOS SISTEMAS DE EXTRACCION DE RESPUESTAS CON ANALISIS SEMANTICO Y APRENDIZAJE PROFUNDOinfo:eu-repo/grantAgreement/AEI/Plan Estatal de Investigaci贸n Cient铆fica y T茅cnica y de Innovaci贸n 2017-2020/PID2020-113230RB-C22/ES/SEQUENCE LABELING MULTITASK MODELS FOR LINGUISTICALLY ENRICHED NER: SEMANTICS AND DOMAIN ADAPTATION (SCANNER-UVIGO)In order to minimize the generalization error in neural networks, a novel technique to identify overfitting phenomena when training the learner is formally introduced. This enables support of a reliable and trustworthy early stopping condition, thus improving the predictive power of that type of modeling. Our proposal exploits the correlation over time in a collection of online indicators, namely characteristic functions for indicating if a set of hypotheses are met, associated with a range of independent stopping conditions built from a canary judgment to evaluate the presence of overfitting. That way, we provide a formal basis for decision making in terms of interrupting the learning process. As opposed to previous approaches focused on a single criterion, we take advantage of subsidiarities between independent assessments, thus seeking both a wider operating range and greater diagnostic reliability. With a view to illustrating the effectiveness of the halting condition described, we choose to work in the sphere of natural language processing, an operational continuum increasingly based on machine learning. As a case study, we focus on parser generation, one of the most demanding and complex tasks in the domain. The selection of cross-validation as a canary function enables an actual comparison with the most representative early stopping conditions based on overfitting identification, pointing to a promising start toward an optimal bias and variance control.Agencia Estatal de Investigaci贸n | Ref. TIN2017-85160-C2-2-RAgencia Estatal de Investigaci贸n | Ref. PID2020-113230RB-C22Xunta de Galicia | Ref. ED431C 2018/5

    Towards robust word embeddings for noisy texts

    Get PDF
    Research on word embeddings has mainly focused on improving their performance on standard corpora, disregarding the difficulties posed by noisy texts in the form of tweets and other types of non-standard writing from social media. In this work, we propose a simple extension to the skipgram model in which we introduce the concept of bridge-words, which are artificial words added to the model to strengthen the similarity between standard words and their noisy variants. Our new embeddings outperform baseline models on noisy texts on a wide range of evaluation tasks, both intrinsic and extrinsic, while retaining a good performance on standard texts. To the best of our knowledge, this is the first explicit approach at dealing with these types of noisy texts at the word embedding level that goes beyond the support for out-of-vocabulary words.European Social Fund | Ref. BES-2015-073768Xunta de Galicia | Ref. ED431D 2017/12Xunta de Galicia | Ref. ED431B 2017/01Xunta de Galicia | Ref. ED431G/01European Research Council | Ref. FASTPARSE, grant agreement No 71415
    corecore